CRAN status Lifecycle: maturing Build Status Codecov test coverage

Overview

Crosstable is a package centered on a single function, crosstable, which easily computes descriptive statistics on datasets.

It can use the tidyselect syntax for selecting variables (and more) and is interfaced with the package officer to create automatized reports.

Installation

install.packages("devtools")
devtools::install_github("DanChaltiel/crosstable", build_vignettes=TRUE)

In case of any installation problem, try reading the wiki or fill an Issue.

Getting help

You can use the vignettes (click on the links):

Usage

Base usage

Without any more argument than the dataset, the function will summarise all numeric variables with statistics (min+max, mean+sd, median+IQR, N+NA) and all categorical variables with counts and percentages.

library(crosstable)
library(dplyr) #for the pipe

crosstable(iris)
#>             .id        label   variable         value
#> 1  Sepal.Length Sepal.Length  Min / Max     4.3 / 7.9
#> 2  Sepal.Length Sepal.Length  Med [IQR] 5.8 [5.1;6.4]
#> 3  Sepal.Length Sepal.Length Mean (std)     5.8 (0.8)
#> 4  Sepal.Length Sepal.Length     N (NA)       150 (0)
#> 5   Sepal.Width  Sepal.Width  Min / Max     2.0 / 4.4
#> 6   Sepal.Width  Sepal.Width  Med [IQR] 3.0 [2.8;3.3]
#> 7   Sepal.Width  Sepal.Width Mean (std)     3.1 (0.4)
#> 8   Sepal.Width  Sepal.Width     N (NA)       150 (0)
#> 9  Petal.Length Petal.Length  Min / Max     1.0 / 6.9
#> 10 Petal.Length Petal.Length  Med [IQR] 4.3 [1.6;5.1]
#> 11 Petal.Length Petal.Length Mean (std)     3.8 (1.8)
#> 12 Petal.Length Petal.Length     N (NA)       150 (0)
#> 13  Petal.Width  Petal.Width  Min / Max     0.1 / 2.5
#> 14  Petal.Width  Petal.Width  Med [IQR] 1.3 [0.3;1.8]
#> 15  Petal.Width  Petal.Width Mean (std)     1.2 (0.8)
#> 16  Petal.Width  Petal.Width     N (NA)       150 (0)
#> 17      Species      Species     setosa   50 (33.33%)
#> 18      Species      Species versicolor   50 (33.33%)
#> 19      Species      Species  virginica   50 (33.33%)

Column specification, grouping and labels

You can select specific columns using names and helpers functions, and require specific summary statistics using funs and funs_arg. The by argument allows to specify a grouping variable. Here, as the mtcars2 has labels, they are also included in the crosstable.

The as_flextable function allows to output a beautiful HTML table that can be customized at will ( see the flextable package) and embed in a Word document (see the officer package).

library(tidyverse)
ct1 = crosstable(mtcars2, qsec, ends_with("t"), starts_with("c"), by=vs,
                 funs=c(mean, quantile), funs_arg=list(probs=c(.25,.75), digits=3))
ct1 %>% as_flextable(keep_id=TRUE)

.id

label

variable

Engine

straight

vshaped

qsec

1/4 mile time

mean

19.334

16.694

quantile 25%

18.602

15.995

quantile 75%

19.975

17.415

drat

Rear axle ratio

mean

3.859

3.392

quantile 25%

3.718

3.070

quantile 75%

4.080

3.702

wt

Weight (1000 lbs)

mean

2.611

3.689

quantile 25%

2.001

3.236

quantile 75%

3.209

3.844

cyl

Number of cylinders

4

10 (90.91%)

1 (9.09%)

6

4 (57.14%)

3 (42.86%)

8

0 (0%)

14 (100.00%)

carb

Number of carburetors

mean

1.786

3.611

quantile 25%

1.000

2.250

quantile 75%

2.000

4.000

Margins and totals

The margin argument changes the percentages calculation, while the total argument adds total rows or columns.

#margin and totals
ct2 = crosstable(mtcars2, disp, vs, by=am, margin=c("row", "col"), total="both")
ct2 %>% as_flextable

label

variable

Transmission

Total

auto

manual

Displacement (cu.in.)

Min / Max

120.1 / 472.0

71.1 / 351.0

71.1 / 472.0

Med \[IQR\]

275.8 \[196.3;360.0\]

120.3 \[79.0;160.0\]

196.3 \[120.8;326.0\]

Mean (std)

290.4 (110.2)

143.5 (87.2)

230.7 (123.9)

N (NA)

19 (0)

13 (0)

32 (0)

Engine

straight

7 (50.00% / 36.84%)

7 (50.00% / 53.85%)

14 (43.75%)

vshaped

12 (66.67% / 63.16%)

6 (33.33% / 46.15%)

18 (56.25%)

Total

19 (59.38%)

13 (40.62%)

32 (100.00%)

Predicate functions, automatic testing

For the variable selection, you can use predicate functions. It is a good practice to wrap these in where. If the grouping variable is numeric, correlation coefficients will be calculated.

Using the test argument, you can perform tests with each variable and the grouping variable. Beware, automatic testing should only be done in an exploratory context, as it would cause extensive alpha inflation otherwise.

ct3 = crosstable(mtcars2, where(is.numeric), by=hp, test=TRUE)
ct3 %>% as_flextable

label

variable

Gross horsepower

test

Miles/(US) gallon

pearson

-0.78
95%CI [-0.89;-0.59]

p value: <0.0001
(Pearson’s product-moment correlation)

Displacement (cu.in.)

pearson

0.79
95%CI [0.61;0.89]

p value: <0.0001
(Pearson’s product-moment correlation)

Rear axle ratio

pearson

-0.45
95%CI [-0.69;-0.12]

p value: 0.0100
(Pearson’s product-moment correlation)

Weight (1000 lbs)

pearson

0.66
95%CI [0.4;0.82]

p value: <0.0001
(Pearson’s product-moment correlation)

1/4 mile time

pearson

-0.71
95%CI [-0.85;-0.48]

p value: <0.0001
(Pearson’s product-moment correlation)

Number of carburetors

pearson

0.75
95%CI [0.54;0.87]

p value: <0.0001
(Pearson’s product-moment correlation)

Lambda functions, effect size calculation

The predicate function can be a lambda function, using .x as the variable name.

Using the effect argument, you can calculate effect sizes for all numeric variables and for categorical variable of exactly 2 levels.

ct4 = crosstable(mtcars2, where(~is.numeric(.x) && mean(.x)>50), by=vs, effect=TRUE)
ct4 %>% as_flextable

label

variable

Engine

effect

straight

vshaped

Displacement (cu.in.)

Min / Max

71.1 / 258.0

120.3 / 472.0

Difference in means (Welch CI) (straight minus vshaped): -174.69
CI95%[-235.02 to -114.36]

Med [IQR]

120.5 [83.0;162.4]

311.0 [275.8;360.0]

Mean (std)

132.5 (56.9)

307.1 (106.8)

N (NA)

14 (0)

18 (0)

Gross horsepower

Min / Max

52.0 / 123.0

91.0 / 335.0

Difference in means (Welch CI) (straight minus vshaped): -98.37
CI95%[-130.67 to -66.06]

Med [IQR]

96.0 [66.0;109.8]

180.0 [156.2;226.2]

Mean (std)

91.4 (24.4)

189.7 (60.3)

N (NA)

14 (0)

18 (0)

Formula syntax, survival variables

Finally, you can describe survival data using the Surv object from the package survival. The times and followup arguments allows for more control.

This is only possible using the formula syntax of variable selection, which allows more complex selection and is written as var1 + var2 ~ group.

library(survival)
ct5 = crosstable(aml, Surv(time, status) ~ x, times=c(0,15,30,150), followup=TRUE)
ct5 %>% as_flextable

label

variable

x

Maintained

Nonmaintained

Surv(time, status)

t=0

1.00 (0/11)

1.00 (0/12)

t=15

0.82 (2/8)

0.58 (5/7)

t=30

0.61 (2/5)

0.29 (3/4)

t=150

0.18 (3/1)

0 (3/0)

Median follow up [min ; max]

103 [13 ; 161]

NA [16 ; 45]

Median survival

31

23

Acknowledgement

crosstable is a rewrite of the awesome biostat2 package written by David Hajage. The user interface is quite different but the concept is the same.

Thanks David!